# Lightweight Vision-Language Model

Smoldocling 256M Preview Mlx Bf16 Docling Snap
This is a 256M-parameter preview version of a document understanding model, specifically designed for document structure parsing and content extraction tasks, supporting the conversion of image documents into structured data.
Image-to-Text Transformers English
S
ds4sd
246
1
Qwen2.5 VL 7B Captioner Relaxed GGUF
Apache-2.0
Qwen2.5-VL-7B-Captioner-Relaxed is a multimodal vision-language model based on the Qwen2.5 architecture, focusing on image-to-text generation tasks.
Image-to-Text English
Q
samgreen
320
1
Smolvlm2 500M Video Instruct Mlx 8bit Skip Vision
Apache-2.0
MLX format model converted from SmolVLM2-500M-Video-Instruct, supporting video-to-text tasks
Image-to-Text Transformers English
S
mlx-community
51
2
Smolvlm2 500M Video Instruct Mlx
Apache-2.0
This is a video-text-to-text model based on the MLX format, developed by HuggingFaceTB, supporting English language processing.
Image-to-Text Transformers English
S
mlx-community
2,491
12
Llavaguard V1.2 0.5B OV
LlavaGuard is a safety evaluation guardian based on vision-language models, primarily used for safety classification and violation detection of image content.
Image-to-Text
L
AIML-TUDA
239
2
Doubutsu 2b Pt 756
Apache-2.0
Doubutsu is a lightweight vision-language model series, specifically designed for customized scenario fine-tuning.
Image-to-Text Transformers English
D
qresearch
129
3
Cerule V0.1
Cerule is a lightweight yet powerful vision-language model built on Google's Gemma-2b and SigLIP, focusing on image-text processing.
Image-to-Text Transformers English
C
Tensoic
157
47
Uform Gen2 Dpo
Apache-2.0
UForm-Gen2-dpo is a small generative vision-language model, aligned for image caption generation and visual question answering tasks through Direct Preference Optimization (DPO) on VLFeedback and LLaVA-Human-Preference-10K preference datasets.
Image-to-Text Transformers English
U
unum-cloud
3,568
44
Moondream Prompt
Apache-2.0
A fine-tuned version of Moondream2, optimized for image prompt generation. It is a lightweight vision-language model suitable for efficient operation on edge devices.
Image-to-Text Transformers
M
gokaygokay
162
10
Llava Phi 2 3b
MIT
LLaVa-Phi-2-3B is an open-source multimodal chatbot model, fine-tuned based on the Phi-2 architecture, capable of processing image and text inputs to generate natural language responses.
Text-to-Image Transformers English
L
marianna13
153
13
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase